Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495

Open
subrata-ms wants to merge 15 commits into
mainfrom
subrata-ms/CP1252Encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
subrata-ms wants to merge 15 commits into
mainfrom
subrata-ms/CP1252Encoding

Conversation

@subrata-ms
Copy link
Copy Markdown
Contributor

@subrata-ms subrata-ms commented Apr 3, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

  • The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)

  • All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

  • The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

  • Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

@github-actions github-actions Bot added the pr-size: large Substantial code update label Apr 3, 2026
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 3, 2026

📊 Code Coverage Report

🔥 Diff Coverage

80%


🎯 Overall Coverage

79%


📈 Total Lines Covered: 6902 out of 8720
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/connection.py (100%)
  • mssql_python/pybind/ddbc_bindings.cpp (83.0%): Missing lines 2066,3416-3421,3432-3443,3449-3454,3521-3522,3524-3525,3538-3541,3606-3607,3609-3610,3623-3626,5000-5002,5260-5261,5311-5313,5614-5615
  • mssql_python/pybind/ddbc_bindings.h (50.0%): Missing lines 844-853,859-863

Summary

  • Total: 331 lines
  • Missing: 66 lines
  • Coverage: 80%

mssql_python/pybind/ddbc_bindings.cpp

Lines 2062-2070

  2062                         size_t chunkBytes = DAE_CHUNK_SIZE;
  2063                         while (offset < totalBytes) {
  2064                             size_t len = std::min(chunkBytes, totalBytes - offset);
  2065 
! 2066                             rc = putData((SQLPOINTER)(dataPtr + offset), static_cast<SQLLEN>(len));
  2067                             if (!SQL_SUCCEEDED(rc)) {
  2068                                 LOG("SQLExecute: SQLPutData failed for "
  2069                                     "SQL_C_CHAR chunk - offset=%zu",
  2070                                     offset, totalBytes, len, rc);

Lines 3412-3425

  3412                                     "length=%lu",
  3413                                     i, (unsigned long)numCharsInData);
  3414                             } else {
  3415                                 // Buffer too small, fallback to streaming
! 3416                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3417                                     "truncated, using streaming LOB",
! 3418                                     i);
! 3419                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3420                                                               "utf-16le"));
! 3421                             }
  3422                         } else if (dataLen == SQL_NULL_DATA) {
  3423                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3424                             row.append(py::none());
  3425                         } else if (dataLen == 0) {

Lines 3428-3447

  3428                             // Driver cannot report total length up front; this is
  3429                             // NOT a NULL value. Fall back to streaming via
  3430                             // FetchLobColumnData (repeated SQLGetData chunks) so
  3431                             // we don't silently lose data.
! 3432                             LOG("SQLGetData: SQL_NO_TOTAL for column %d (CHAR via WCHAR), "
! 3433                                 "streaming via FetchLobColumnData",
! 3434                                 i);
! 3435                             row.append(
! 3436                                 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
! 3437                         } else if (dataLen < 0) {
! 3438                             LOG("SQLGetData: Unexpected negative data length "
! 3439                                 "for column %d - dataType=%d, dataLen=%ld",
! 3440                                 i, dataType, (long)dataLen);
! 3441                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3442                                               "data length");
! 3443                         }
  3444                     } else {
  3445                         // Surface driver errors instead of silently returning NULL.
  3446                         // Returning py::none() here would be indistinguishable from
  3447                         // a genuine SQL NULL value to the Python caller and is a

Lines 3445-3458

  3445                         // Surface driver errors instead of silently returning NULL.
  3446                         // Returning py::none() here would be indistinguishable from
  3447                         // a genuine SQL NULL value to the Python caller and is a
  3448                         // data-integrity risk.
! 3449                         LOG_ERROR("SQLGetData: Error retrieving data for column %d "
! 3450                                   "(CHAR via WCHAR) - SQLRETURN=%d",
! 3451                                   i, ret);
! 3452                         ThrowStdException("SQLGetData failed for CHAR/VARCHAR column "
! 3453                                           "fetched as SQL_C_WCHAR");
! 3454                     }
  3455                 } else {
  3456                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3457                     //
  3458                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 3517-3529

  3517                             // Driver cannot report total length up front; this is
  3518                             // NOT a NULL value. Fall back to streaming via
  3519                             // FetchLobColumnData (repeated SQLGetData chunks) so
  3520                             // we don't silently lose data.
! 3521                             LOG("SQLGetData: SQL_NO_TOTAL for column %d (SQL_CHAR), "
! 3522                                 "streaming via FetchLobColumnData",
  3523                                 i);
! 3524                             row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
! 3525                                                           effectiveCharEnc));
  3526                         } else if (dataLen < 0) {
  3527                             LOG("SQLGetData: Unexpected negative data length "
  3528                                 "for column %d - dataType=%d, dataLen=%ld",
  3529                                 i, dataType, (long)dataLen);

Lines 3534-3545

  3534                         // Surface driver errors instead of silently returning NULL.
  3535                         // Returning py::none() here would be indistinguishable from
  3536                         // a genuine SQL NULL value to the Python caller and is a
  3537                         // data-integrity risk.
! 3538                         LOG_ERROR("SQLGetData: Error retrieving data for column %d "
! 3539                                   "(SQL_CHAR) - SQLRETURN=%d",
! 3540                                   i, ret);
! 3541                         ThrowStdException("SQLGetData failed for SQL_CHAR/VARCHAR column");
  3542                     }
  3543                 }
  3544                 break;
  3545             }

Lines 3602-3614

  3602                             // Driver cannot report total length up front; this is
  3603                             // NOT a NULL value. Fall back to streaming via
  3604                             // FetchLobColumnData (repeated SQLGetData chunks) so
  3605                             // we don't silently lose data.
! 3606                             LOG("SQLGetData: SQL_NO_TOTAL for column %d (NVARCHAR), "
! 3607                                 "streaming via FetchLobColumnData",
  3608                                 i);
! 3609                             row.append(
! 3610                                 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
  3611                         } else if (dataLen < 0) {
  3612                             LOG("SQLGetData: Unexpected negative data length "
  3613                                 "for column %d (NVARCHAR) - dataLen=%ld",
  3614                                 i, (long)dataLen);

Lines 3619-3630

  3619                         // Surface driver errors instead of silently returning NULL.
  3620                         // Returning py::none() here would be indistinguishable from
  3621                         // a genuine SQL NULL value to the Python caller and is a
  3622                         // data-integrity risk.
! 3623                         LOG_ERROR("SQLGetData: Error retrieving data for column %d "
! 3624                                   "(NVARCHAR) - SQLRETURN=%d",
! 3625                                   i, ret);
! 3626                         ThrowStdException("SQLGetData failed for NVARCHAR column");
  3627                     }
  3628                 }
  3629                 break;
  3630             }

Lines 4996-5006

  4996                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4997                 break;
  4998             default:
  4999                 std::ostringstream errorString;
! 5000                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 5001                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 5002                             << (i + 1);
  5003                 LOG(errorString.str().c_str());
  5004                 ThrowStdException(errorString.str());
  5005                 break;
  5006         }

Lines 5256-5265

  5256                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5257                                                  sizeof(DateTimeOffset),
  5258                                                  buffers.indicators[idxCol].data());
  5259                             if (!SQL_SUCCEEDED(ret)) {
! 5260                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5261                                     idxCol + 1);
  5262                                 return ret;
  5263                             }
  5264                             break;
  5265                         }

Lines 5307-5317

  5307                     nullCounts[idxCol] += 1;
  5308                     continue;
  5309                 } else if (indicator < 0) {
  5310                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5311                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5312                         "Length - %lld",
! 5313                         idxCol + 1, dataType, (long long)indicator);
  5314                     ThrowStdException("Unexpected negative data length.");
  5315                 }
  5316                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5610-5619

  5610         arrowSchemaBatchCapsule =
  5611             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5612                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5613                 if (arrowSchema->release) {
! 5614                     arrowSchema->release(arrowSchema);
! 5615                 }
  5616                 delete arrowSchema;
  5617             });
  5618     } catch (...) {
  5619         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  840                 // Decode failed — fall back to returning the raw UTF-16LE bytes
  841                 // (consistent with the narrow-char path below and with
  842                 // FetchLobColumnData / SQLGetData_wrap which return raw bytes on
  843                 // decode failure instead of silently dropping data as None).
! 844                 PyErr_Clear();
! 845                 PyObject* pyBytes = PyBytes_FromStringAndSize(
! 846                     reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));
! 847                 if (pyBytes) {
! 848                     PyList_SET_ITEM(row, col - 1, pyBytes);
! 849                 } else {
! 850                     PyErr_Clear();
! 851                     Py_INCREF(Py_None);
! 852                     PyList_SET_ITEM(row, col - 1, Py_None);
! 853                 }
  854             } else {
  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {

  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {
  858             // LOB / truncated: stream with SQL_C_WCHAR
! 859             PyList_SET_ITEM(row, col - 1,
! 860                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 861                                 .release()
! 862                                 .ptr());
! 863         }
  864         return;
  865     }
  866 
  867     // Original narrow-char path (charCtype == SQL_C_CHAR)


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 66.5%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.3%
mssql_python.pybind.connection.connection.cpp: 76.2%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review April 3, 2026 08:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

  • Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
  • Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
  • Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mssql_python/connection.py Changes default decoding settings for SQL_CHAR to UTF-16LE/SQL_WCHAR.
mssql_python/cursor.py Passes updated decoding settings (encoding + ctype) into the native fetch functions.
mssql_python/pybind/ddbc_bindings.cpp Adds charCtype plumb-through and implements SQL_C_WCHAR paths for SQLGetData and bound-column fetching.
mssql_python/pybind/ddbc_bindings.h Extends per-column metadata and adds a wide-char branch in ProcessChar.
tests/test_013_encoding_decoding.py Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/connection.py
Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread tests/test_013_encoding_decoding.py Outdated
Comment thread tests/test_013_encoding_decoding.py
Comment thread mssql_python/pybind/ddbc_bindings.cpp
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread mssql_python/pybind/ddbc_bindings.cpp
Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment thread mssql_python/connection.py
Copy link
Copy Markdown
Contributor

@sumitmsft sumitmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Otherwise, I am ok with this change

Comment thread mssql_python/pybind/ddbc_bindings.cpp
jahnvi480
jahnvi480 previously approved these changes May 11, 2026
Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment on lines 6071 to 6078
py::arg("StatementHandle"), py::arg("row"), py::arg("charEncoding") = "utf-8",
py::arg("wcharEncoding") = "utf-16le");
py::arg("wcharEncoding") = "utf-16le", py::arg("charCtype") = SQL_C_WCHAR);
m.def("DDBCSQLFetchMany", &FetchMany_wrap, py::arg("StatementHandle"), py::arg("rows"),
py::arg("fetchSize"), py::arg("charEncoding") = "utf-8",
py::arg("wcharEncoding") = "utf-16le", "Fetch many rows from the result set");
py::arg("wcharEncoding") = "utf-16le", py::arg("charCtype") = SQL_C_WCHAR,
"Fetch many rows from the result set");
m.def("DDBCSQLFetchAll", &FetchAll_wrap, "Fetch all rows from the result set",
py::arg("StatementHandle"), py::arg("rows"), py::arg("charEncoding") = "utf-8",
Copy link
Copy Markdown
Contributor

@gargsaumya gargsaumya May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pybind11 default for charEncoding is still "utf-8" here, but this PR changed the connection-level default to "utf-16le" everywhere else (Connection.init, setdecoding(), all three cursor fallbacks in cursor.py). I think these three defaults should be "utf-16le" to stay consistent with the rest of the PR., just veify this once?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in latest commit. Thanks for bringing this to my attention.

bewithgaurav
bewithgaurav previously approved these changes May 12, 2026
Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving with test addition suggestions below.

logic lgtm. fix for #468 reproduces clean against sql2022, new TestIssue531Utf8CollationVarchar covers the SQL_C_WCHAR-default path on linux/macOS, all 185 encoding tests pass.

one ask before merge: the test coverage can be improved.

I went through each missing line. four python tests would close most of the gap:

  1. fetchone with NULL + empty VARCHAR. covers null/empty arms in the new CHAR-via-WCHAR block.
cursor.execute("SELECT CAST(NULL AS VARCHAR(10)), CAST('' AS VARCHAR(10))")
row = cursor.fetchone()
assert row[0] is None and row[1] == ''
  1. fetchone with VARCHAR(MAX) and NVARCHAR(MAX) carrying 50k+ non-ASCII. exercises wide-char LOB streaming in ProcessChar and SQLGetData_wrap. large LOBs also surface SQL_NO_TOTAL on the first chunk, picks up that branch incidentally.
payload = "caf \u00ad " * 6250  # ~50k
cursor.execute("CREATE TABLE #t (v VARCHAR(MAX) COLLATE Latin1_General_100_CI_AS_SC_UTF8, n NVARCHAR(MAX))")
cursor.execute("INSERT INTO #t VALUES (?, ?)", payload, payload)
cursor.execute("SELECT v, n FROM #t")
assert cursor.fetchone() == (payload, payload)
  1. fetchone on a TEXT column. legacy LOB, deterministically returns SQL_NO_TOTAL on the first chunk. cheapest way to actually exercise that branch.
cursor.execute("CREATE TABLE #t (v TEXT)")
cursor.execute("INSERT INTO #t VALUES (?)", "A" * 100_000)
cursor.execute("SELECT v FROM #t")
assert cursor.fetchone()[0] == "A" * 100_000
  1. fetchall on VARCHAR(N) with UTF-8 collation at size boundary. catches off-by-one in the wide-char path with non-BMP chars.
cursor.execute("CREATE TABLE #t (v VARCHAR(4) COLLATE Latin1_General_100_CI_AS_SC_UTF8)")
cursor.execute("INSERT INTO #t VALUES (N'\U0001F600')")
assert cursor.fetchall()[0][0] == '\U0001F600'

the decode-failure-to-bytes branch and the throw paths can't easily be hit from python. fine as a follow-up, or at minimum mark them as defensive in a code comment so a future refactor doesn't quietly strip them.

@subrata-ms
Copy link
Copy Markdown
Contributor Author

approving with test addition suggestions below.

logic lgtm. fix for #468 reproduces clean against sql2022, new TestIssue531Utf8CollationVarchar covers the SQL_C_WCHAR-default path on linux/macOS, all 185 encoding tests pass.

one ask before merge: the test coverage can be improved.

I went through each missing line. four python tests would close most of the gap:

  1. fetchone with NULL + empty VARCHAR. covers null/empty arms in the new CHAR-via-WCHAR block.
cursor.execute("SELECT CAST(NULL AS VARCHAR(10)), CAST('' AS VARCHAR(10))")
row = cursor.fetchone()
assert row[0] is None and row[1] == ''
  1. fetchone with VARCHAR(MAX) and NVARCHAR(MAX) carrying 50k+ non-ASCII. exercises wide-char LOB streaming in ProcessChar and SQLGetData_wrap. large LOBs also surface SQL_NO_TOTAL on the first chunk, picks up that branch incidentally.
payload = "caf \u00ad " * 6250  # ~50k
cursor.execute("CREATE TABLE #t (v VARCHAR(MAX) COLLATE Latin1_General_100_CI_AS_SC_UTF8, n NVARCHAR(MAX))")
cursor.execute("INSERT INTO #t VALUES (?, ?)", payload, payload)
cursor.execute("SELECT v, n FROM #t")
assert cursor.fetchone() == (payload, payload)
  1. fetchone on a TEXT column. legacy LOB, deterministically returns SQL_NO_TOTAL on the first chunk. cheapest way to actually exercise that branch.
cursor.execute("CREATE TABLE #t (v TEXT)")
cursor.execute("INSERT INTO #t VALUES (?)", "A" * 100_000)
cursor.execute("SELECT v FROM #t")
assert cursor.fetchone()[0] == "A" * 100_000
  1. fetchall on VARCHAR(N) with UTF-8 collation at size boundary. catches off-by-one in the wide-char path with non-BMP chars.
cursor.execute("CREATE TABLE #t (v VARCHAR(4) COLLATE Latin1_General_100_CI_AS_SC_UTF8)")
cursor.execute("INSERT INTO #t VALUES (N'\U0001F600')")
assert cursor.fetchall()[0][0] == '\U0001F600'

the decode-failure-to-bytes branch and the throw paths can't easily be hit from python. fine as a follow-up, or at minimum mark them as defensive in a code comment so a future refactor doesn't quietly strip them.

Hi @bewithgaurav , this suggested test does not improve code coverage ( ~1%). The lines that are flagged as not covered are hard to reach with current test framework. We need to add dummy/test code within the production code to create this scenario (one option). I tried that and it was looking pretty big change on the current coding framework.

@subrata-ms subrata-ms closed this May 12, 2026
@subrata-ms subrata-ms reopened this May 12, 2026
@yan-hic
Copy link
Copy Markdown

yan-hic commented May 12, 2026

I hit the same error using cursor.arrow() and .arrow_reader().

Can you add tests for these cases too, as the conn.setdecoding(_SQL_CHAR, "cp1252") workaround only affects DBAPI fetching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants